Statistics about how much data the 1000 Genomes Project produced are accessible in several different ways. Information on some of the formats used for this information is available on the FTP site.
For raw data, a sequence.index file contains base and read counts for each of the active FASTQ files.
During the 1000 Genomes Project, summery statistics were provided in a sequence indices directory, which is now located with historical data from the project. This contains four summary files, two exome and two low coverage. Both of these analysis groups have a .stats file providing numbers of runs withdrawn for different reasons, base count and coverage statistics for each study, population level summaries and a stats.csv file which provides a comparison to the previous index in terms of number of runs, bases and similar metrics. Since late 2012, the 1000 Genomes Project also produced analysis.sequence.index files, which only consider Illumina runs of 70bp read length or longer, and also have statistics files.
For the aligned data all BAM and CRAM files have BAS files associated with them. These contain read group level statistics for the alignment. We also provide this in a collected form in alignment index files. The alignment indices for the alignments of the 1000 Genomes Project data to GRCh38 are available on the FTP site. There is also an historic alignment indices directory, which contains a .hsmetrics file with the results of the Picard tool CalculateHsMetrics for all the exome alignments and summary files, which compare statistics between old and new alignment releases during the 1000 Genomes Project.
The 1000 Genomes Project aims to sequenced 2504 individuals in total both low coverage whole genome sequencing and exome sequencing. Further samples added into the IGSR will increase this number.
NA12878 the CEU child from our high coverage trio represents our largest amount of sequence data with 4.2 Tbases of sequence, the majority of this sequence data is from 2008 and short read length (~36bp) so is not the highest quality we have. You can see how many bases we have sequenced for all our samples by looking in our sequence index file. The 25th column of this file is the base count in each fastq file.
The project has generally used short insert libraries between 100 and 600bp for Illumina sequence data. For SOLiD and 454 sequence data you will see a wider variety of insert sizes. The insert sizes are reported in both the sequence.index file and the bas files. The sequence index file contains the insert size reported to the SRA when the data was submitted, the bas files contain the mean insert size based on the alignment and the standard deviation from that mean.
As the project started sequencing in 2008 it holds a wide range of read lengths, the Illumina and SOLiD data range between 25bp to 160bp read lengths. Our sequence index file report read and base counts for each fastq file which can be used to find this out more precisely. For the final analysis phase of the project only Illumina data which is 70bp or longer was used and where required samples were sequenced again to match this criterion.